Review

Applied problem: Merging samples

Repeated measures for 11 individuals, mean (sd)

Round Duration Number Correct
1 7.5 9.0
(2.0) (3.3)
2 7.5 9.0
(2.0) (3.3)
3 7.5 9.0
(2.0) (3.3)
4 7.5 9.0
(2.0) (3.3)

Applied problem: Merging samples

Regression of Duration on Number Correct repeated for each round

Round Term Estimate SE
1 (Intercept) 3.0 1.12
num.correct 0.5 0.12
2 (Intercept) 3.0 1.13
num.correct 0.5 0.12
3 (Intercept) 3.0 1.12
num.correct 0.5 0.12
4 (Intercept) 3.0 1.12
num.correct 0.5 0.12


Remember…

Always look at the data first


ALWAYS. LOOK. AT. THE. DATA.

Today

ggplot2

What is a graph?

A visual display that illustrates one or more relationships among numbers…a shorthand means of presenting information that would take many more words and numbers to describe.

—Stephen M. Kosslyn. Graph Design for the Eye and Mind. Oxford University Press, 2006

It depends on the goal:

At a minimum…

Psychological principles (Kosslyn, 2006)

Get their attention

  1. Relevance
    • Not too much or too little information
    • Present information that reflects the message you want to convey
    • Don’t present extraneous information
  2. Appropriate knowledge
    • Prior knowledge must be sufficient to understand the graph
    • If you assume too much prior knowledge, viewers will be confused
    • If you violate norms, viewers will be confused

If they are confused, they won’t try to understand your graph

Hold and direct their attention

  1. Salience
    • Attention is drawn to large perceptible differences
    • The most visually striking aspect receives the most attention
    • Annotations help direct viewers’ attention
  2. Discriminability
    • Properties must differ enough to be noticed
    • Defaults in ggplot2 do much of this work for you
  3. Organization
    • Groups of elements are seen and remembered as a whole

Try to anticipate the process the audience will go through while looking at your graph

Help them remember

  1. Compatibility
    • Form should be aligned with meaning
    • Lines express continuous change, bars discrete quantities
    • More = more (higher, better, bigger, etc.)
  2. Informative changes
    • Changes in properties should carry information
    • …and vice versa
  3. Capacity limitations
    • If too much information is presented, none is remembered
    • Four chunks in working memory
    • Graph designers err on the side of presenting too much, graph readers err on the side of paying too little attention

Decide what you want them to remember; everything else is secondary to that

ggplot2’s grammar

ggplot2’s grammar

Layers




Test data

test_data
## # A tibble: 44 x 4
##    round respondent num.correct duration
##    <fct> <fct>            <dbl>    <dbl>
##  1 1     1                   10     8.04
##  2 1     2                    8     6.95
##  3 1     3                   13     7.58
##  4 1     4                    9     8.81
##  5 1     5                   11     8.33
##  6 1     6                   14     9.96
##  7 1     7                    6     7.24
##  8 1     8                    4     4.26
##  9 1     9                   12    10.8 
## 10 1     10                   7     4.82
## # … with 34 more rows

Defaults

my_plot <- ggplot(data = test_data, mapping = aes(x = duration,
    y = num.correct))
my_plot <- ggplot(test_data, aes(x = duration, y = num.correct))

An empty plot

print(my_plot)

Adding a layer

my_plot + geom_point()

Each layer has a geometry

my_plot + geom_point()
my_plot + geom_line()


my_plot + geom_point() + geom_line()

Each layer has a statistic

ggplot(test_data, aes(x = duration)) + geom_histogram(binwidth = 2)


Result of applying binning function to duration

## # A tibble: 44 x 4
##    round respondent num.correct duration
##    <fct> <fct>            <dbl>    <dbl>
##  1 1     1                   10     8.04
##  2 1     2                    8     6.95
##  3 1     3                   13     7.58
##  4 1     4                    9     8.81
##  5 1     5                   11     8.33
##  6 1     6                   14     9.96
##  7 1     7                    6     7.24
##  8 1     8                    4     4.26
##  9 1     9                   12    10.8 
## 10 1     10                   7     4.82
## # … with 34 more rows
## # A tibble: 5 x 2
##       x     y
##   <dbl> <dbl>
## 1     4     4
## 2     6    13
## 3     8    20
## 4    10     5
## 5    12     2

Geoms and statistics

Item Default stat/geom
geom_point stat_identity (\(f(x)=x\))
geom_line stat_identity (\(f(x)=x\))
geom_histogram stat_bin (binning)
geom_smooth stat_smooth (regression)
stat_smooth geom_smooth (line + ribbon)
stat_bin geom_bar (vertical bars)
stat_identity geom_point (dots)

Data versus statistics

ggplot(test_data, aes(x = round,
  y = duration)) + geom_point()

ggplot(test_data, aes(x = round,
  y = duration)) + geom_boxplot()

Aesthetics

Item Required Optional
geom_point xy alphacolourfillshapesizestroke
geom_line xy alphacolourlinetypesize
geom_pointrange xymaxymin alphacolourlinetypesize
my_plot + geom_point(
  mapping = aes(colour = round))

my_plot + geom_point(
  colour="red")

Position

g <- ggplot(test_data, aes(x = num.correct, fill = round))
g + stat_bin(binwidth = 4,
             position = 'stack')

g + stat_bin(binwidth = 4,
             position = 'dodge')

Practice with layers (Tasks 1–4)

Data

library(ggplot2) # or: library(tidyverse)
?mpg
Fuel economy data from 1999 and 2008 for 38 popular models of car

Description:
     This dataset contains a subset of the fuel economy data that the
     EPA makes available on http://fueleconomy.gov. It contains
     only models which had a new release every year between 1999 and
     2008 - this was used as a proxy for the popularity of the car.

Usage:
     mpg
     
Format:
     A data frame with 234 rows and 11 variables

     manufacturer
     model         model name
     displ         engine displacement, in litres
     year          year of manufacture
     cyl           number of cylinders
     trans         type of transmission
     drv           f = front-wheel drive, r = rear wheel drive, 4 = 4wd
     cty           city miles per gallon
     hwy           highway miles per gallon
     fl            fuel type
     class         "type" of car

mpg
## # A tibble: 234 x 11
##    manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4      1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4      1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4      2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4      2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4      2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4      2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4      3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 q…   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 q…   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 q…   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

Task 0 (Example)


Do Tasks 1–4

Facets and discrete groups

g <- ggplot(mpg, aes(x = displ, y = hwy))
g + geom_point(aes(colour = drv))

g + geom_point() + facet_wrap(~drv)

Groups


ggplot(mpg, aes(x = displ, y = hwy,
              colour=cyl)) +
  geom_point() + geom_smooth()

ggplot(mpg, aes(x = displ, y = hwy,
              colour=factor(cyl))) +
  geom_point() + geom_smooth()


  • To override the automatic grouping, specify aes(group=1) when creating a layer
ggplot(mpg, aes(x = displ, y = hwy, colour = factor(cyl))) +
    geom_point() + geom_smooth(aes(group = 1))

Scales

  • Scales apply to the entire plot, i.e., to every layer
  • ggplot2 can detect what type of scale you might want, but it isn’t perfect
  • For example, you might want a logarithmic scale instead of the default linear scale
ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() +
    scale_y_log10(breaks = c(15, 30, 45))

Labels

  • Always annotate graphs with a title and human-readable labels for each aesthetic
    • x- and y-axes
    • Legends and colour bars
ggplot(mpg, aes(x = displ,
                y = hwy,
                colour = drv)) +
 geom_point() +
 labs(x = "Displacement (litres)",
      y = "Highway miles per gallon",
      colour = "Drive train",
      title = "Automobile features")

Relabelling

mpg2 <- mpg %>%
  mutate(drv2 = case_when(drv == 'f' ~ 'Front',
                          drv == '4' ~ '4WD',
                          drv == 'r' ~ 'Rear'))
ggplot(mpg2, aes(x = displ, y = hwy, colour = drv2)) + geom_point() +
  labs(colour = "Drive train")


ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() +
  facet_wrap(~ drv, labeller = as_labeller(c('f' = 'Front',
                                             'r' = 'Rear',
                                             '4' = '4WD')))

  • Another alternative is to use the forcats package to relabel/reorder factors

Task 5

More reading